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I.  INTRODUCTION 


For  three  decades  interest  in  simulation  modeling  and  simulation  languages  has 
been  expanding,  almost  keeping  pace  with  the  phenomenal  rate  of  growth  of  computer 
technology.  Lagging  somewhat  behind  has  been  the  concern  for  the  validation  of  the 
resulting  simulation  models;  that  is,  the  establishment  of  some  level  of  confidence  that 
the  model  does,  in  fact,  accurately  mimic  some  real-world  process.  In  the  last  fifteen 
years,  research  in  validation  techniques  has  been  substantially  increased;  and  a 
consensus  of  general  conclusions  has  formed: 

1.  validation  is  problem  dependent  -  there  is  no  one  general  validation 
technique,  mainly  because  the  output  from  a  model  may  be  independent  or 
correlated,  univariate  or  multivariate,  stationary  or  dynamic,  and  so  forth; 
in  fact,  the  model  itself  may  be  deterministic  or  stochastic, 

2.  in  general,  absolute  validity  is  nonexistent  -  once  a  particular  technique  has 
been  established,  the  model  is  usually  validated  only  for  a  specific  purpose 
and  over  a  specific  range  of  values, 

3.  empirical  data  are  necessary  -  in  order  to  validate  a  model,  some  comparison 
of  output  data  with  real-world  data  must  be  made;  furthermore,  these 
empirical  data  must  be  independent  of  those  used  in  construction  of  the 
model,  and 

4.  statistical  tests  are  desirable  -  of  the  many  methods  proposed  for  validating 
simulation  models,  the  use  of  statistical  tests  seems  to  be  preferred,  possibly 
because  of  the  ability  to  establish  some  level  of  confidence. 

Because  computer  simulation  models  are  prevalent  at  the  Ballistic  Research 
Laboratory,  the  Experimental  Design  and  Analysis  Branch  of  the  Systems  Engineering 
and  Concepts  Analysis  Division  was  funded  to  perform  research  in  the  area  of  the 
validation  of  such  models.  Results  from  the  research  are  summarized  in  this  report. 
They  include  a  thorough  literature  review  in  which  we  examined  existing  validation 
techniques  along  with  additional  related  information.  Eventually  we  developed  two 
nonparametric  procedures,  demonstrating  them  on  a  simulation  model  currently  used  by 
the  Vulnerability /Lethality  Division. 

Nonparametric  validation  methods  generally  involve  a  procedure  known  as 
hypothesis  testing.  The  initial  step  is  to  state  a  null  hypothesis,  usually  ”the  simulation 
model  is  valid.”  Then  a  level  of  confidence  is  established,  often  95%;  and  a  particular 
test  statistic  is  chosen.  Two  different  errors  are  present  in  hypothesis  testing.  The  first 
is  called  a  Type  I  error  and  occurs  when  a  true  null  hypothesis  is  rejected.  If  the  level 
of  confidence  has  been  set  at  95%,  then  it  follows  that  the  probability  of  a  Type  I  error 
is  5%.  However,  in  simulation  model  validation  a  Type  II  error  is  the  more  important 
to  control;  this  occurs  when  a  false  null  hypothesis  is  accepted.  No  level  of  confidence  is 
pre-established  to  guard  against  accepting  an  invalid  model;  but,  for  any  particular 
statistical  test,  a  measure  of  the  protection  against  this  error  is  given  by  the  power  of 
the  test,  equal  to  the  probability  of  rejecting  the  null  hypothesis  when  it  is  false. 
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Unfortunately,  there  is  a  tradeoff  between  the  two  error  types;  as  the  level  of 
confidence  is  increased  (lower  probability  of  a  Type  I  error),  the  power  of  the  test  is 
decreased  (higher  probability  of  a  Type  II  error).  This  implies  that  one  way  to  increase 
the  power  of  a  test  is  to  decrease  the  level  of  confidence  in  it.  There  are,  however,  more 
satisfactory  ways;  and  they  will  be  mentioned  in  the  summary  of  this  report.  The 
important  point  to  remember  is  that  when  attempting  to  validate  a  simulation  model 
using  hypothesis  testing,  it  is  imperative  that  the  statistical  test  be  a  powerful  one. 

D.  LITERATURE  REVIEW 

As  the  electronic  computer  became  a  more  powerful  tool,  computer  simulation 
became  a  more  viable  method  by  which  the  behavior  of  a  given  process  could  be 
characterized.  As  early  as  the  1950’s,  articles  were  being  published  about  computer 
modeling  of  entire  systems;  and  soon  after,  specialized  simulation  languages  were 
developed.  The  pioneers  in  this  field  realized  the  need  for  some  assurance  that  the 
simulation  output  would  be  consistent  with  the  empirical  data  that  were  available. 
However,  prior  to  1967  there  was  very  little  written  that  provided  any  explicit 
procedures  which  might  be  applied  to  determine  the  soundness  of  a  computer  model.  In 
that  year  several  papers  concerning  this  problem  were  published,  and  two  of  them 
became  a  foundation  upon  which  most  subsequent  efforts  have  been  constructed. 

In  1657,  Fishman  and  Kiviat1  provided  definitions  which  differentiated  the  notions 
of  verification  and  validation,  terms  which  had  previously  been  used  interchangeably. 
"Verification  determines  whether  a  model  with  a  particular  mathematical  structure  and 
data  base  actually  behaves  as  an  experimenter  assumes  it  does.  Validation  tests 
whether  a  simulation  model  reasonably  approximates  a  real  system."  Most  individuals 
working  in  this  area  today  have  subscribed  to  these  definitions,  although  papers 
continue  to  be  published  which  do  not  discriminate  between  the  two  ideas.  Figure  1, 
taken  from  a  paper  by  Winter,  et.  al.2,  is  a  Venn  diagram  illustrating  the  relationship 
between  verification,  validation,  and  other  concepts  within  the  field  of  computer 
simulation.  Stone3  believed  the  word  assessment  ”...  is  preferable  to  validation  which 
has  a  ring  of  excessive  confidence  about  it.”  However,  in  this  paper  we  will  continue  to 
consider  validation  as  defined  by  Van  Horn,4  who  expanded  on  the  previous  definition 
by  giving  it  a  somewhat  statistical  flavor.  "Validation  ...  is  the  process  of  building  an 
acceptable  level  of  confidence  that  an  inference  about  a  simulated  process  is  a  correct  or 
valid  inference  for  the  actual  process 


*  Fuhman,  G.S.  Mid  Kivut,  P.J.,  ’Digital  Competes  Simulation  Statistical  Considerations,’  Memorandom  RM-5387-PR,  "IV 
Rand  Corporation,  1667. 

2 

Wister.  E M  .  WiseaiHer.  D  P ,  and  lijihara,  J  K  .  "Verification  and  'Validation  e!  Engineering  Simctatiocs  with  Minimal  Data.' 
Proceed  1  pro  of  the  1076  Summer  Compater  Simoiition  Conference.  1676. 

3  Stone.  M  ,  ’Cross-Validating  Choice  and  Assessment  of  Statistical 
Prediction,’  Journal  of  the  RoTd  Statistical  Society,  series  B-36. lS7d 


Van  Here,  R  ,  ’Validation,’  The  Design  of  Coa-.tster  Sinrehtion  EnoeTiments,  Duke  University  Press.  18C8 


FIGURE  h  RELATIONSHIPS  BETWEEN  THE  VARIOUS  CONCEPTS  OF  A  COMPUTER  SIMULATION 


The  second  influential  paper  to  appear  in  1967  was  by  Naylor  and  Finger.5  In  it 
they  proposed  a  three-stage  approach  to  validation  of  a  computer  simulation.  This 
technique,  or  a  modified  version  of  it,  has  been  used  by  numerous  authors.  Law6  has 
augmented  their  approach  with  speciSc  suggestions  for  each  of  the  three  stages: 

1.  develop  high  face-validity  -  insure  that  the  simulation  seems  reasonable  to 
those  people  who  are  knowledgeable  in  the  area, 

2.  test  the  simulation  assumptions  -  examine  the  data  used  in  building  the 
simulation  and  empirically  test  the  assumptions  drawn  from  those  data,  and 

3.  compare  simulation  output  data  with  empirical  data  -  use  tests,  statistical  if 
possible,  to  determine  a  level  of  confidence  in  the  simulation. 

When  attempting  to  validate  existing  models,  the  first  two  stages  will  often  have 
already  been  completed  by  the  developer  of  the  simulation  leaving  only  the  third  stage, 
potentially  the  most  difficult. 


Ntylor.  T  H  »nd  Finger.  J  M  .  ’Verification  of  Computes  SimuUt,;»  Models.’  Micirmnii  Stance  Vo)  H  No  2.  1967 

g 

L»w.  A  M  .  Simulation  Modelmt  nnd  Ansi 


|y?is.  University  of  Wisconsin,  1979 


Not  everyone  subscribes  to  the  three-stage  approach  to  validation.  However,  then- 
does  scern  to  be  a  general  agreement  that  the  third  stage,  comparing  simulation  output 
data  with  empirical  data,  is  crucial.  Sometimes  obtaining  empirical  data  in  the  region 
of  applicability  is  very  difficult,  especially  in  engineering  simulations.  Winter,  et.  al.2 
mention  in  that  case,  ”The  quality  of  the  component  models  and  the  excellent 
knowledge  of  the  random  process  along  with  a  systematic  verification  must  be  a 
substitute  for  validation.”  However,  Fishman  and  Kiviat*  are  firm  in  their  statement 
that  ”  ...  if  no  numerical  data  exist  for  an  actual  system,  it  is  not  possible  to  establish 
the  quantitative  congruence  of  a  model  with  reality.”  In  attempting  to  perform  this 
third  stage,  Wright7  suggests  that  three  questions  be  considered: 

1.  how  do  we  intelligently  compare  simulation  output  data  with  empirical  data, 

2.  how  do  we  collect  and  exploit  the  empirical  data  used  in  our  tests,  and 

3.  how  do  we  transform  the  results  of  these  tests  into  a  confidence  in  the 
computer  simulation? 

Finally,  Baird,  et.  al 8  warn  that  the  empirical  data  used  for  comparison  with  the 
simulation  output  data  must  be  independent  of  those  used  in  building  the  computer 
model;  otherwise,  we  have  only  verification  of  the  simulation. 

Tytula®  has  divided  the  many  methods  used  for  the  data  comparison  into  five 
general  categories: 

1.  judgemental  comparison  -  this  method  seems  to  be  the  most  widely  used  and 
includes  graphical  analysis  and  the  comparison  of  common  properties  such  as 
the  mean  and  variance;  it  is  easy  to  use  and  quite  practical,  but  the  impact 
of  errors  in  judgement  is  difficult  to  assess, 

2.  hypothesis  testing  -  this  method  includes  goodness-of-fit  tests,  analysis-of- 
variance  techniques,  and  nonparametric  ranking  methods;  since  this  will  be 
the  category  of  interest  in  our  report,  the  advantages  and  disadvantages  will 
be  discussed  in  the  succeeding  section, 

3.  spectra!  analysis  -  since  the  output  of  many  simulation  models  is  in  the  form 
of  a  time  series,  this  method  is  particularly  useful;  however,  it  is  difficult  to 
relate  the  invalidity  at  a  particular  frequency  to  the  ovcral1  simulation 
validity. 


1  Wnght.  R  D ,  *V»Iid»i«£g  Dyniaiic  Models  As  Er*jsii;oa  of  Torts  ef  Predictive  Pouts  * 


Bind.  A  M .  Golaois.  R  B .  Eryie.  W  C ,  Holt.  A'  C .  »ad  Stiisst.  F  M .  Vrr.scitioB  ltd  Viliditios  ef  RF-Esv 
Models  -  Methodology  Overview Soeisg  Aerespite  Conpity.  1SSD 

o 

Tytcli,  TP,  ’A  Method  for  Validities  Missile  Syrtta  Sisolitioa  Models*  Teeleicil  Report  £-78-11.  US  Ar 
Restores  tad  DevelopmtBt  Cemnisd.  1978 


4.  sensitivity  analysis  -  this  method  can  determine  a  range  of  parameter  values 
and  assumptions  over  which  the  simulation  is  valid,  but  it  is  usually  difficult 
to  analyze  the  effects  of  the  characteristics  drifting  outside  this  range,  and 

5.  indices  of  performance  -  this  method  is  useful  in  ranking  models;  however,  it 
is  impossible  to  pick  a  value  for  a  given  index  which  will  always  imply  a 
valid  simulation. 

Validation  is  a  difficult  process  because,  as  Tytula®  points  out,  no  single 
satisfactory  method  exists.  Most  techniques  are  problem  dependent;  and,  indeed,  the 
output  data  of  a  simulation  may  be  independent  or  correlated,  univariate  or 
multivariate,  stationary  or  dynamic.  In  fact,  Garrett10  states  that,  "The  critical 
dimension  affecting  the  applicability  of  various  techniques  is  that  of  the  deterministic  or 
stochastic  nature  of  the  output.”  Only  a  few  authors  have  attempted  to  provide  a 
general  validation  technique  -  see  Gilmour11  for  an  example.  Most  have  developed 
methods  which  apply  to  a  select  subset  of  simulation  models;  and,  even  then,  the 
simulation  is  often  validated  only  for  a  particular  purpose  or  over  a  particular  range  of 
values.  In  tha'  case,  care  must  be  taken  not  to  apply  the  simulation  model  outside  the 
validated  regio*. 

ffl.  VALIDATION  PROCEDURES 

In  this  report  we  will  be  examining  hypothesis  testing  as  a  method  for  validating 
both  deterministic  and  stochastic  computer  simulation  models.  This  type  of  procedure 
allows  some  level  of  confidence  to  be  attached  to  the  results.  When  employing 
hypothesis  testing,  several  assumptions  must  usually  be  stated;  but  by  using 
nonparametric  ranking  techniques  we  will  eliminate  one  major  (and  often  unjustifiable) 
assumption  -  that  the  data  arise  from  a  normal  distribution. 

Sargent12  notes  that  for  hypothesis  testing  we  generally  assume  a  null  hypothesis 
that  the  simulation  model  is  valid.  Then  by  establishing  a  level  of  confidence  for  a 
particular  statistical  test,  we  fix  the  probability  of  a  Type  I  error  in  which  we  reject  a 
valid  model.  However,  for  simulation  validation  it  is  more  important  to  minimize  the 
probability  of  a  Type  II  error,  that  is,  accepting  an  invalid  model.  The  magnitude  of 
the  Type  D  error  can  be  determined  by  the  power  function  of  the  statistical  test  where 
the  power  is  the  probability  of  rejecting  a  false  null  hypothesis.  For  a  fixed  sample  size 
there  is  a  tradeoff  between  the  two  error  types,  so  that  we  can  increase  the  power  at  the 
expense  of  the  confidence  level.  Unfortunately,  the  power  can  not  be  computed  against 


10  Gnrrett,  M  .  ’Sututicil  Validation  of  Simulation  Modtlj.*  Proceedings  of  the  1974  Summer  Computer  Simulation  Conference 
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11  Gilmour.  P,  ’A  General  Validation  Procedure  for  Computer  Simuluioa  Models,’  The  AuMrailias  Computet  Journal  Vela 
No  3.  1873 
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an  alternative  hypothesis  as  general  as,  The  simulation  model  is  invalid”;  and 
therefore,  it  must  be  examined  against  an  array  of  different  specific  alternative 
hypotheses.  Nevertheless,  we  continue  to  search  for  powerful  statistical  tests  with 
justifiable  assumptions  which  will  still  provide  acceptable  levels  of  confidence. 

Let  X  =  (xj,  x2,  Xj.)  be  a  vector  of  inputs  to  a  simulation  model,  and  let  y  be  an 
output  resulting  from  X.  Then  y  may  take  on  a  single  value,  as  in  a  deterministic 
model,  or  many  values,  as  is  the  case  with  a  stochastic  model.  Let  z  be  the 
corresponding  value  from  the  real-world  process  given  the  same  input  vector.  In 
general,  y  will  not  be  equal  to  z  since  X  contains  only  a  finite  number  of  input  variables; 
ostensively,  the  most  relevant  ones.  The  purpose  of  the  simulation  model  is  to  mimic  the 
real-world  process.  Thus,  in  attempting  to  validate  it,  we  compare  each  empirical  value 
with  the  corresponding  model  output  generated  under  the  same  conditions;  that  is,  the 
same-  values  for  the  vector  X. 

Suppose  there  exist  N  pairs  of  data  (yJt  Zj),  (y2,  Zo),  .  .  .,  (y^-,  zN)  available  for 
comparison,  where  each  pair  corresponds  to  a  different  input  vector  and  where  each  V; 
may  itseif  be  a  vector  of  values  in  the  case  of  a  stochastic  model.  Reynolds  and 
Deaton13  note  that  because  each  of  the  pairs  was  generated  under  different  conditions,  it 
would  be  incorrect  to  pool  the  data  and  proceed  with  the  testing  of  our  hypothesis. 
Rather,  we  must  find  a  statistical  procedure  which  examines  each  pair  individually  and 
then  allows  for  the  combination  of  these  results  into  one  overall  test  that  provides 
reasonable  power.  With  this  as  our  goal,  we  propose  to  use  two  nonparamctric 
statistical  procedures  -  the  Wilcoxon  signed-ranks  test  in  the  case  of  a  deterministic 
model  and,  for  a  stochastic  model,  a  process  which  combines  independent  cases  of  the 
Mann-Whitney  test. 

Deterministic  Model 

A  deterministic  model  provides  one  and  only  one  set  of  output  values  for  each  set 
of  input  values.  Such  a  model  is  frequently  used  as  a  first  attempt  at  representing  a 
stochastic  system,  and  quite  often  it  mill  adequately  simulate  at  least  the  coarse 
behavior  of  such  a  system.  The  deterministic  model  generally  has  the  advantages  of 
being  both  simple  and  inexpensive.  Any  individual  output  value  y  from  the  model  can 
be  compared  with  an  empirical  value  z  obtained  from  the  actual  system  under  the  same 
set  of  input  values.  Considering  N  different  input  sets,  the  available  data  consist  of  N 
observations  (y*j,  z,)f  (y2,  z2), .  . (ys,  zN)  of  bivariate  random  variables.  The  Wilcoxon 
signed-ranks  test  is  applicable.  The  null  hypothesis  of  this  test  can  be  loosely  stated  as. 
”The  values  of  the  y;’s  tend  to  be  the  same  as  the  values  of  the  zfs,”  which  we  can 
interpret  as,  ”The  simulation  model  is  valid.” 
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The  Wilcoxon  signed-ranks  tests  is  a  hypothesis  test  for  identical  medians  that  uses 
paired  observations.  To  use  it,  we  first  compute  D;  =  y;  -  Z;  for  i  —  1,  2,  N, 
recalling  that  each  of  these  random  variables  may  be  from  a  different  distribution.  The 
following  four  assumptions  are  made  concerning  these  D;’s: 

1.  the  distribution  of  each  Df  is  symmetric, 

2.  the  D;’s  are  mutually  independent, 

3.  the  D;’s  all  have  the  same  median,  call  it  m  50,  and 

4.  the  measurement  scale  of  the  Dj’s  is  at  least  interval. 

The  fourth  assumption  means  that  for  any  two  observations  on  the  random 
variable  we  can  distinguish  not  only  which  is  larger  and  which  is  smaller,  but  also  which 
is  farther  from  the  common  median. 

The  null  hypothesis  is  that  m  50  =  0;  in  other  words,  that  all  the  D;’s  have  medians 
equal  to  zero.  This  would  indicate  that  the  *  <.  •  of  the  j'j’s  and  the  Z;’s  tend  to  be  the 

same.  A  rank  Rj,  based  on  the  absolute  value  of  each  Dit  is  assigned;  thus,  the  R;’s 
consist  of  the  integers  1  to  N.  R;  is  then  adjusted  to  zero  for  each  Dj  <  0.  The  non¬ 
zero  integers  that  remain  are  the  ranks  of  the  positive  D;'s;  and  a  test  statistic  T  is 

defined  to  be  their  sum;  that  is,  T  =  £  R;.  Very  high  and  vev  •  low  values  of  T  cause 

i 

rejection  of  the  null  hypothesis.  The  theory  behind  the  test  is  explained  very  clearly  by 
Conover14,  where  tables  containing  various  quantiles  of  the  Wilcoxon  signed-ranks  test 
statistic  are  available. 

One  further  assumption  is  sometimes  made,  that  each  D;  is  a  continuous  random 
variable.  Theoretically,  this  assures  that  there  will  be  no  D;  —  0  and  no  Dj  =  I)j  where 
iy£j.  However,  in  practice  the  available  data  mav  produce  zeros  and  tics;  and  methods 
have  been  devised  for  handling  these  situations.  Although  it  is  often  recommended  that 
the  zeros  be  dropped  from  the  data  immediately,  they  are  sometimes  very  important, 
especially  when  attempting  to  show  that  there  is  no  significant  difference  between  the 
values  of  the  yj’s  and  Zj’s.  Lehmann15  proposes  ranking  the  absolute  values  of  all  the 
Dj’s  including  the  zeros  and,  in  the  case  of  ties,  assigning  each  of  the  tied  values  the 
average  of  the  ranks  normally  due  them.  Then  the  R;’s  are  adjusted  by  multiplying 
them  by  -1  if  D;  <  0,  0  if  Dj  =  0,  or  1  if  Dj  >  0.  The  test  statistic  Tt  then  becomes 
the  sum  of  the  positive  Rj’s,  and  a  second  test  statistic  T2  is  defined  as  the  sum  of  the 
absolute  value  of  the  negative  R,’s.  Rejection  of  the  null  hypothesis  is  caused  by  very 
high  values  of  either  T,  or  T2. 


14  Cono«r,  W  J.t  Practical  Nonpsrametric  Statistics.  John  Wiley  &  (ons.  Inc  ,  1971. 

15  Lehmann,  E.L ,  Nonoaramctrifs.  Statistical  Method;  Based  on  Rank?.  Holden-Day,  Inc  ,  197S 


As  mentioned  earlier,  a  misuse  of  hypothesis  testing  as  a  method  of  simulation 
validation  occurs  when  too  little  concern  is  shown  for  the  power  of  the  test.  The  power 
is  the  probability  of  rejecting  an  invalid  model,  and  we  would  like  this  probability  to  be 
as  close  to  one  as  possible.  Unfortunately,  the  power  can  be  calculated  only  for  specific 
alternative  hypotheses.  In  order  to  generate  power  curves  for  the  Wilcoxon  signed-ranks 
test,  it  is  convenient  to  make  the  additional  assumption  that  all  Dj’s  come  from  a 
common  distribution.  Although  this  may  not  always  be  valid,  it  does  afford  us  an 
indication  of  the  power  of  the  test  against  an  alternative  consisting  of  a  shift  in  the 
mean,  which  fo*  a  symmetric  distribution  is  identical  to.  the  median.  Figure  2  shows 
some  power  curves  for  this  test  against  a  shift  in  the  mean  when  the  underlying 
distribution  of  the  Dj’s  is  normal  with  a  mean  equal  to  fx  and  a  variance  equal  to  one. 
Recall  that  a  true  null  hypothesis  would  indicate  that  the  values  of  the  y;’s  and  the  z,'s 
tend  to  be  equal.  These  curves  were  generated  using  a  Monte-Carlo  procedure  which 
incorporated  10,000  replications.  Note  the  increase  in  power  as  the  number  of 
observations  increases.  Figures  3-5  display  some  power  curves  for  other  alternative 
hypotheses,  each  figure  assuming  a  different  common  distribution  for  the  D,’s  with  a 
corresponding  modification  of  one  of  the  parameters  of  the  distribution.  Notice  when 
the  abscissa  is  equal  to  zero  (when  the  null  hypothesis  is  true),  the  probability  of 
rejection  is  0.05  -  the  value  chosen  for  the  probability  of  a  Type  I  error.  The  faster  the 
curve  approaches  one,  the  more  powerful  the  test  against  that  particular  alternative 
hypothesis.  Although  very  narrow  in  their  scope,  these  results  do  provide  us  with  an 
indication  of  the  overall  power  of  the  test  against  a  shift  in  location  and  allow  us  to 
determine  the  extent  to  which  the  probability  of  a  Type  Q  error  might  be  reduced  by  an 
increase  in  sample  size. 

Stochastic  Model 

A  stochastic  model  provides  a  set  of  output  values  that,  for  each  given  set  of  input 
values,  occurs  with  a  certain  probability.  Mihram16  states  that  this  ”...  probability  ... 
serves  as  a  measure  of  our  human  ignorance  of  the  actual  situation  and  its 
implications.”  Generally,  the  behavior  of  the  system  is  too  complicated  to  include  all  of 
the  appropriate  inputs  in  the  computer  model.  Even  if  it  were  possible,  the  return  in 
accuracy  provided  by  such  thoroughness  may  be  small.  Refinement  of  a  computer 
model  usually  leads  to  stochastic  modeling;  and  because  of  the  abilities  of  today’s 
computers,  the  use  of  such  modeling  has  substantially  increased. 

Given  M  replications,  output  of  the  model  becomes  a  set  of  values  y1.  v2,  ...,  for 
each  set  of  input  values  which  can  be  compared  with  (in  our  case)  a  single  corresponding 
empirical  value  z.  Recall  that  X  is  a  vector  of  most,  but  not  all,  of  the  relevant  input 
variables.  Then  z,  given  the  value  of  X,  is  a  random  variable  reflecting  the  random 
error  due  to  the  exclusion  of  certain  factors  from  X.  Also  y,  of  course,  is  a  random 
variable  since  the  simulation  model  is  stochastic.  We  would  like  to  show  that  F(y  |  X). 
the  conditional  distribution  function  of  y,  is  equal  to  G(z|X),  the  conditional 
distribution  function  of  z  for  all  -  oo  <  y,  z  <  oo  and  for  all  X. 
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FIGURE  5:  POWER  OF  57.-LEVEL  TEST 

HO:  F*LOGISrriC(0,l)  VS.  HI:  F=LOGISTIC(a*0,l) 
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data  consist  of  N  observations 
y$,  — »  Yni  zn)  °f  multivariate 
random  variables,  where  the  yk’s  for  any  given  observation  share  a  common  distribution. 
Mihram16  suggests  ranking  yy,  y  2,  y;M,  z;  for  each  i;  if  the  model  is  valid,  we  would 
expect  the  Z;  to  fall  somewhere  in  the  middle  of  such  a  ranking.  This  is  the  initial  step 
in  a  procedure  known  as  the  Mann-Whitney  test,  a  particular  case  in  which  one  of  the 
random  variables,  namely  zi(  has  a  sample  size  of  one.  Since  we  are  dealing  with  N 
observations,  we  need  a  method  by  which  we  can  combine  independent  cases  of  the 
Mann-Whitney  test;  such  a  method  has  been  proposed  by.  Van  Elteren17  and  referenced 
in  a  very  clear  example  by  Reynolds,  et.al.,18. 

The  Mann-Whitney  test  is  a  hypothesis  test  involving  samples  from  two 
distributions  that  tests  for  equality  of  the  distributions.  For  each  input  set  X  a  sample 
of  M  output  sets  y1,  y2,  ...,  yM  is  obtained  from  the  computer  simulation,  and  the 
empirical  observation  z  provides  another  sample  of  size  one.  The  following  throe 
assumptions  are  made: 

1)  both  samples  are  random  samples  from  their  respective  populations, 

2)  in  addition  to  independence  within  each  sample,  there  is  mutual 
independence  between  the  two  samples,  and 

3)  the  measur°ment  scale  is  at  least  ordinal. 

The  third  assumption  means  that  for  any  two  observations  on  the  random  variable  we 
can  distinguish  which  is  larger  and  which  is  smaller. 

The  null  hypothesis  is  that  F(y|X)  =  G(z|X)  for  a  given  input  set  X.  When  we 
combine  N  of  these  tests,  in  the  manner  suggested  by  Van  Elteren,  we  have  the  null 
hypothesis  of  F(y  |  X)  =  G(z  I  X)  for  all  -oo  <  y,  z  <  oo  and  for  all  X,  which  we  can 
interpret  as,  "The  simulation  model  is  valid.”  Let  R;  be  the  rank  of  Z;  in  the  ith 
observation  (y,1,  y  2,  ...,  y,M,  z5);  thus,  R;  is  an  integer  between  1  and  M  +  1.  Then  a 
test  statistic  T  is  defined  as  the  sum  of  the  R;’s  over  a.'.  N  observations;  that  is, 

T  =  Rj.  Very  high  or  very  low  values  of  T  will  cause  rejection  of  the  null  hypothesis. 

i 

The  theory  behind  the  Mann-Whitney  test  is  given  in  Conover14,  and  the  combination 
of  such  tests  is  explained  by  Van  Elteren17. 

A  fourth  assumption  is  usually  made,  that  both  samples  consist  of  random 
variables  from  continuous  distributions.  As  in  the  case  of  the  Wilcoxon  test  statistic, 
this  is  to  assure  that  there  will  be  no  zeros  and,  more  importantly,  no  ties.  However,  for 

17 

Van  EU«rea,P ,  ’On  the  Combination  of  Independent  Two  Sample  Tests  of  Wilcoxon,* 

Bnlletin  do  I’lnstitnte  International  de  Statistione.  37,  10S0 
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Considering  N  different  input  sets,  the 
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this  test,  a  moderate  number  of  ties  is  tolerable;  and  they  are  handled  as  previously  by 
assigning  each  of  the  tied  values  the  average  of  the  ranks  normally  ^ue  them. 

The  power  of  this  test  against  alternative  hypotheses  analogous  to  those  shown  for 
the  Wilcoxon  test  is  displayed  in  Figures  6-9  which  were  generated  using  a  Monte-Carlo 
procedure  which  incorporated  2,000  replications.  Once  again,  in  generating  these  power 
curves,  we  have  made  one  additional,  albeit  restrictive,  assumption;  namely,  the 
distribution  of  the  yj’s  is  the  same  for  each  vector  of  input  values,  and  similarly  for  the 
distribution  of  the  Z;’s.  Although  it  would  be  preferable  to  avoid  this  assumption,  it  is 
necessary  in  order  to  test  against  specific  alternative  hypotheses  -  in  this  case,  a  shift  in 
the  mean;  and,  as  with  the  WilcoxoL  test,  these  curves  do  provide  an  indication  of  the 
overall  power  of  this  combination  of  Mann-Whitney  tests  against  the  shift  in  location. 
This  test  appears  slightly  less  powerful  than  the  Wilcoxon  signed-ranks  test.  This  is  a 
result  of  the  assumption  of  the  less  stringent  ordinal  measurement  scale.  If  M  =  1,  the 
combined  Mann-Whitney  test  reduces  to  the  sign  test,  a  nonparametric  procedure 
similar  to  the  Wilcoxon  test  but  making  no  assumption  of  symmetry  of  the  distributions 
and  consequently  requiring  only  an  ordinal  measurement  scale,  resulting  in  a  less 
powerful  test.  Reynolds  and  Deaton13  look  at  some  test  statistics  similar  to  T  designed 
to  be  more  powerful  against  other  alternative  hypotheses. 


IV.  EXAMPLE 

The  Vulnerability  Analysis  for  Surface  Targets  (VAST)  model  is  a  computer 
simulation  currently  in  use  at  the  Ballistic  Research  Laboratory  to  evaluate  the  effect  of 
kinetic  energy  projectiles  or  shaped-charge  threats  against  a  single  surface  target.19  It 
incorporates  damage  from  both  the  primary  penetrator  and  any  associated  spall 
fragments;  but  currently  it  is  unable  to  handle  damage  resulting  from  blast,  heat,  and 
certain  synergistic  effects  such  as  ricochets.  Furthermore,  there  is  a  variety  of  opinions, 
estimates,  and  decisions,  all  based  on  the  experience  of  the  vulnerability  analysts  but 
generally  providing  vague  and  imprecise  data,  which  subsequently  serve  as  input  to  the 
simulation.  Nevertheless,  results  demonstrate  reasonable  face  validity,  so  an  attempt  at 
statistical  validation  of  the  model  seems  feasible. 

A  target  description  is  produced  by  a  separate  computer  code  using  a  combination 
of  geometric  figures  and,  once  generated,  can  be  viewed  from  any  orientation.  After  a 
viewing  angle  has  been  established,  a  rectangular  grid  is  superimposed  over  the  target  in 
the  plane  orthogonal  to  that  angle.  From  a  (uniform)  randomly-selected  point  within 
each  grid  cell,  a  ray  is  traced  through  the  target;  and  a  list  is  constructed  of  all 
components  encountered.  If  a  spall-producing  component  is  encountered,  spall  rays  are 
traced  from  that  point  of  impact  to  all  critical  components  in  the  target.  These  rays 
represent  spall  fragments  whose  size,  shape,  and  velocity  are  chosen  at  random  from 
specified  distributions. 


Hafer,  T  F  and  Hafer,  A  S  ,  ’Vulnerability  Analysis  for  Surface  Target:  (VAST)  An  Int-ruil  Point-Burst  Vulnerability  Model.* 
ARBRL-TR-0215f,  US  Army  Ballistic  Research  Laboratory,  1979. 
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FIGURE  8:  POWER  OF  5%-LEVEL  TEST 
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Along  ecch  individual  ray,  residual  masses  and  velocities  of  the  primary  penetrator 
and  associated  spall  fragments  are  used  to  calculate  the  probability  of  incapacitation  for 
each  critical  component.  These  are  then  combined  over  all  critical  components  and 
provide  a  loss  of  function  (LOF)  for  the  particular  cell,  further  combined  over  all  cells  to 
provide  a  LOF  for  the  particular  orientation,  and  finally  combined  over  several 
orientations  to  provide  an  overall  LOF  for  the  target.  Although  its  input  is  stochastic- 
in  nature,  the  VAST  model  is  generally  run  with  just  one  replication  because  the  results 
are  fairly  consistent  from  replication  to  replication  and  because  the  model  requires 
considerable  time  and,  hence,  expense  to  execute. 

Data  were  provided  by  vulnerability  assessors  who  had  estimated  loss  of  function 
for  a  particular  surface  target  based  on  their  inspection  of  actual  damage  from  a 
particular  round  of  ammunition  -  in  this  case,  the  function  evaluated  was  the  mobility 
function.  When  attempting  to  compare  model  output  with  this  empirical  data,  it  was 
first  necessary  to  determine  the  exact  poiDt  of  impact  on  the  surface  target  during  the 
live-fire  exercise.  Then  the  VAST  model  assumed  that  point  of  impact  to  be  the  origin 
of  the  ray  representing  the  primary  penetrator.  Damage  due  to  that  ray  and  its 
associated  spall  rays  were  then  combined  to  provide  a  loss  of  function  value  which  could 
be  compared  with  the  empirical  datum  point.  Therefore,  only  one  orientation  was 
considered  and,  for  that  particular  orientation,  a  ray  originating  at  a  specific  point 
within  only  one  cell  was  examined.  Encountering  a  spall-producing  component  still 
required  a  random  selection  of  spall  characteristics;  and  because  execution  time  was 
reduced,  the  model  was  run  using  thirty  replications  -  the  output  data  appear  in  Table 
1.  The  averaged  results  were  compared  with  the  empirical  data,  in  the  manner 
proposed  for  deterministic  simulations;  individual  outputs  from  the  thirty  two 
replications  were  also  compared  with  the  empirical  data,  this  time  using  the  method 
proposed  for  stochastic  simulations.  Thus,  these  data  provided  examples  for  both  of  our 
proposed  validation  procedures. 

Results  of  the  test  for  the  deterministic  form  of  the  model  appear  in  Table  2. 
Under  the  null  hypothesis  of  a  valid  model,  the  sum  of  the  positive  ranks  should  equal 
the  sum  of  the  absolute  values  of  the  negative  ranks;  that  is,  Tj  =  T2.  Lehmann10 
shows  how  to  establish  critical  values  against  which  the  test  statistic  can  be  evaluated. 
He  derives  the  expectation  of  the  test  statistic, 

E  |T]  =  j  [N  (N  +  1)  -  d„  (d0  +  1)] ,  (1) 

and  the  variance  of  the  test  statistic, 

Var  m  =  ir |N  (N + 11  (2N + » -  d»  (d<>  +  »>  <«. + in 

-^■lE  di  (dj  -  1)  (ds  +  1)1 .  (2) 

48  i-1 

where  T  is  either  the  sum  of  the  positive  ranks  or  the  sum  of  the  absolute  value  of  the 
negative  ranks,  N  is  the  number  of  observations,  d0  is  the  number  of  zero  differences, 
and  d;  represents  the  number  of  tied  differences  for  the  ith  tie  with  n  different  ties. 
Appealing  to  the  central-limit  theorem,  T*  =  (T  -  E  [TjJ/VVar  [T]  tends  to  the 
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-MOBILITY  KILL 


N  VALUES  -  MOBILITY  KILL  (Com 


2  Positive  Ranks  =  327 
2  1  Negative  Ranks  |  =  13$ 

Critical  T-Values  (a  =  0.05)  =  142  (lower),  350  (upper) 
Critical  T-Values  (a  =  0.10)  =  158  (lower),  334  (upper) 


standard  normal  distribution  as  the  number  of  non-zero  differences  tends  to  infinity. 
For  our  example  we  have  32  observations,  eight  zero  differences,  and  one  tie  with  two 
tied  differences;  therefore,  E  [T]  =  246  and  Var  [T]  =  2809.  We  can  calculate  critical 
values  by  evaluating  the  equation  T  =  53z  +  246,  where  z  is  the  a/2  percentile  of  the 
standard  normal  distribution.  As  shown  at  the  bottom  of  Table  2,  even  at  an  a-level  of 
0.10  there  is  no  basis  for  rejecting  the  null  hypothesis. 

Table  3  contains  the  results  for  the  stochastic  model.  Recall  that  R5  is  the  rank  of 
Zj  in  the  ith  observation  (yj1,  y-2, ...,  y;M,  Zj),  and  T  is  defined  as  the  sum  of  the  R/s. 
Under  the  null  hypothesis  of  a  valid  model,  z-,  has  the  same  distribution  as 
>’i\  y2,  ... ,  yjM;  and  therefore,  Rj  is  uniformly  distributed  over  the  values 
1,  2,  ...,  M  +  1.  Modifying  the  results  of  Lehmann15  by  incorporating  the  number  of 
observations,  we  can  calculate  the  expectation  of  the  test  statistic, 

E  m  =  i  |N  (M  +  2)] ,  (3) 

and  the  variance  of  the  test  statistic, 

Var  IT]  =  ±  [N  M  (M  +  2)]  -  IS  fi  «f  -  dy)] ,  M 

where  N  is  the  number  of  observations,  M  is  the  number  of  replications  of  the  model, 
and  dy  represents  the  number  of  tied  values  for  the  jth  tie  in  the  ith  observation  with  n; 
different  ties  in  the  ith  observation.  Then  T*  =  (T  -  E  [T])/vVar  [T]  will  have 
approximately  a  standard  normal  distribution.  For  our  example  we  have  32 
observations,  30  replications,  and  51  instances  of  tied  values  with  varying  numbers  of 
ties;  in  this  case  E  [TJ  =  512  and  Var  [T]  =  1521.  We  can  again  calculate  critical 
values,  this  time  by  evaluating  the  equation  T  =  39z  +  512,  where  z  is  the  a/2 
percentile  of  the  standard  normal  distribution.  As  shown  at  the  bottom  of  Table  3,  there 
is  insufficient  evidence  to  reject  the  null  hypothesis  at  an  a-level  of  0.05;  however,  at  an 
a-level  of  0.10,  the  null  hypothesis  would  be  rejected. 

Since  in  neither  case  could  the  null  hypothesis  be  rejected  at  an  a-level  of  0.05,  we 
must  be  concerned  with  the  possibility  of  a  Type  II  error;  that  is,  accepting  an  invalid 
model.  Figures  2-9  demonstrate  the  power  of  these  tests  against  an  alternative 
consisting  of  a  shift  in  the  mean.  Consider  the  deterministic  case.  Referring  to  Figure 
3,  we  see  that  if  F  (the  distribution  of  the  differences  between  the  model  output  and  the 
empirical  data)  is  uniform,  then  the  power  of  this  test  is  very  good  since  the  probability 
of  rejection  rises  quickly  as  the  parameter  increases  in  value.  Conversely,  Figure  4 
demonstrates  that  if  F  is  Cauchy,  then  the  power  of  the  test  is  rather  poor.  Results  for 
the  stochastic  case  are  analogous.  Figure  7  shows  that  the  power  of  this  test  is  very 
good  if  F  (the  distribution  of  the  model  output)  and  G  (the  distribution  of  the  empirical 
data)  are  both  uniform.  However,  as  seen  in  Figure  8,  if  F  and  G  are  both  Cauchy,  then 
the  power  of  the  test  is  again  rather  poor. 

Reynolds  and  Deaton13  have  proposed  other  test  statistics  more  powerful  against 
different  alternatives;  but  for  the  loss  of  function  data  where  empirical  results  that  are 
close  to  the  value  one  tend  to  be  assigned  that  value,  a  shift  in  the  mean  seems  to  be  an 
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appropriate  alternative  hypothesis.  Since  the  power  against  this  particular  alternative  is 
fairly  good  overall,  our  confidence  in  the  hypothesis  tests  tends  to  increase.  However, 
we  would  like  to  be  able  to  make  these  tests  and  other  tests  still  more  powerful  and,  in 
the  future,  will  be  exploring  methods  to  accomplish  this. 


V.  SUMMARY 

When  referring  to  computer  simulation  models,  a  few  authors  continue  to  use  the 
words  verification  and  validation  interchangeably;  however,  most  distinguish  between 
the  two  terms.  Verification  of  a  computer  model  assures  that  the  simulation  is  behaving 
as  the  modeler  intends,  while  validation  assures  that  the  simulation  is  behaving  as  the 
real  world  does.  Verification  is  the  process  of  debugging  a  computer  program;  validation 
is  making  it  consistent  with  reality. 

Prior  to  1967  very  little  was  written  concerning  the  validation  of  simulations;  but 
much  has  appeared  since  then,  and  there  has  been  general  agreement  on  several  points  - 
the  most  important  being  that  to  validate  a  computer  simulation  model,  empirical 
observations  are  necessary  and  statistical  tests  are  desirable.  All  validation  techniques 
can  be  placed  into  one  of  five  categories:  judgemental  comparisons,  hypothesis  testing, 
spectral  analysis,  sensitivity  analysis,  and  indices  of  performance. 

Nonparametric  ranking  techniques  are  one  class  of  statistical  hypothesis  tests.  We 
have  advocated  the  Wilcoxon  signed-ranks  test  as  a  validation  procedure  for 
deterministic  simulation  models  and  a  combination  of  independent  Mann-Whitney  tests 
as  a  validation  procedure  for  stochastic  simulation  models.  They  are  statistical  tests 
which  assess  empirical  data  to  provide  a  certain  level  of  confidence  in  the  computer 
model.  The  main  disadvantage  of  both  is  the  same  as  that  of  all  hypothesis  testing 
techniques;  namely,  their  concern  for  protecting  against  Type  I  errors,  sometimes  at  the 
expense  of  Type  II  errors.  A  Type  I  error  results  in  rejecting  a  valid  simulation  model  - 
unfortunate,  but  not  as  potentially  dangerous  a3  accepting  an  invalid  simulation  model, 
which  is  known  as  a  Type  II  error.  For  any  particular  test  we  can  get  an  indication  of 
the  probability  of  a  Type  II  error  by  generating  a  series  of  curves  that  will  allow  us  to 
examine  the  power  of  the  test  against  various  alternatives. 

Power  is  defined  as  the  probability  of  rejecting  a  false  null  hypothesis,  and  we 
would  like  this  value  to  be  as  close  to  one  as  possible.  For  our  advocated  tests  we  have 
evaluated  the  power  for  some  specific  alternative  hypotheses  by  incorporating  a  Monte- 
Carlo  procedure  into  a  computer  program,  which  allowed  us  to  perform  thousands  of 
replications.  Each  replication  represents  a  case  in  which  the  alternative  hypothesis  was 
true,  and  we  determined  whether  or  not  the  test  rejected  the  null  hypothesis. 
Obviously,  we  can  not  compute  power  against  an  alternative  hypothesis  as  general  as, 
"The  simulation  model  is  invalid.”  However,  in  being  more  specific  we  are  forced  to 
examine  an  array  of  different  alternative  hypotheses;  and  while  a  test  may  be  powerful 
against  a  subset  of  these  alternatives  (such  as  a  shift  in  the  mean  of  a  distribution),  it 
might  be  less  so  against  others.  The  most  we  can  hope  for  is  reasonable  power  against 
alternatives  important  to  a  particular  investigation.  Both  the  Wilcoxon  signed-ranks 
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test  and  the  combination  of  independent  Mann-Whitney  tests  appear  to  have  reasonable 
power  against  a  shift  in  the  mean,  but  we  would  like  to  be  able  to  increase  it. 

For  any  given  alternative  hypothesis  there  are  several  ways  of  increasing  the  power. 
One  such  way  can  be  seen  in  Figures  2-9  -  increasing  the  number  of  observations. 
Another  way  is  to  reduce  the  level  of  confidence  in  the  test  itself;  that  is,  allow  the 
probability  of  a  Type  I  error  to  increase.  In  the  future  we  will  be  investigating  other 
methods  for  increasing  the  power  of  statistical  hypothesis  tests  in  general  and  of  the  two 
we  have  advocated  in  particular.  These  methods  will  include  a  statistical  procedure 
known  as  bootstrapping,  a  mathematical  theory  known  as  fuzzy  sets,  and,  possibly,  a 
combination  of  the  two.  Because  of  the  importance  in  this  area  of  computer  simulation 
validation,  we  hope  to  develop  ways  to  make  these  tests  more  powerful  against  a  wide 
range  of  alternatives  while  still  permitting  them  to  provide  acceptable  levels  of 
confidence  in  their  results. 
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5.  Does  this  report  satisfy  a  need?  (Comment  on  purpose,  related  project,  or 
other  area  of  interest  for  which  the  report  will  be  used.) _ 


4.  How  specifically,  is  the  report  being  used?  (Information  source,  design 
data,  procedure,  source  of  ideas,  etc.) _ 


S.  Has  the  information  in  this  report  led  to  any  quantitative  savings  as  far 
as  man-hours  or  dollars  saved,  operating  ccsts  avoided  or  efficiencies  achieved, 
etc?  If  so,  please  elaborate. _ 


6.  General  Comments.  What  do  you  think  should  be  changed  to  improve  future 
reports?  (Indicate  changes  to  organization,  technical  content,  format,  etc.) 


Name 


CURRENT 

ADDRESS 


Organization 


Address 


City,  State,  Zip 

7.  If  indicating  a  Change  of  Address  or  Address  Correction,  please  provide  the 
New  or  Correct  Address  in  Block  6  above  and  the  Old  or  Incorrect  address  below. 


Name 


OLD 

ADDRESS 


Organization 


Address 


City,  State,  Zip 


(Remove  this  sheet  along  the  perforation,  fold  as  indicated,  staple  or  tape 
closed,  and  mail.) 


